1 Project introduction

For another scientist:

This is a bioinformatic pipeline that starts with multiple fastq files provided by sequencing company JonahVentures. The project is Lionfish gut content metabarcoding of COI on mtDNA. Each fastq files are the sequences found within the gut of one individual fish. Reference file contains sequences from visually identified fishes collected in the same transect as the lionfish.

For any person off the street:

I am using DNA to identify partially digested material found in the stomachs of lionfish to understand what they are eating. This code will identify the DNA sequences from the stomachs based off of sequences from visually identified species collected in the same area as the lionfish.

3 Set Up

Load the necessary packages

library (tidyverse)
install.packages('insect')
library (insect)
#if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("dada2")
library (digest)
library (seqinr)
library(lubridate)
library(ShortRead)
library(dada2); packageVersion("dada2")
library(dplyr)

4 Check the Starting Quality

Path to fastq files before trimming primers

path <- "../Data/"
list.files(path)

Sorting forward and reverse reads based on file name pattern

fnFs <- sort(list.files(path, pattern=".R1.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern=".R2.fastq", full.names = TRUE))

sample.names <- str_replace(basename(fnFs), ".1.R1.fastq","")

This shows a plot of the quality of the forward reads

In gray-scale is a heat map of the frequency of each quality score at each base position. The mean quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).

plotQualityProfile(fnFs[1:5])

This plot shows quality of the reverse reads

plotQualityProfile(fnRs[1:5])

5 Use CutAdapt to remove Primers

This is the path to cut adapt. I don’t have permission to run on raven but I can run this on my laptop with appropriate directory

cutadapt <- "../Applications/cutadapt.exe"

These are the COI primers. Taken from methods txt file from Jonah ventures. The Function allOrients makes forward, compliment, reverse, and reverse compliment for both primers. This is to look for the primers in the sequences in any orientation (PrimerHits, the next function/ chunck)

FWD <- "GGWACWGGWTGAACWGTWTAYCCYCC"
REV <- "TANACYTCNGGRTGNCCRAARAAYCA" #N's replaced I's from Jonah

allOrients <- function(primer) {
  require(Biostrings)
  dna <- DNAString(primer)
  orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), RevComp = reverseComplement(dna))
  return(sapply(orients, toString))
}

FWD.orients <- allOrients(FWD)
FWD.orients
REV.orients <- allOrients(REV)
REV.orients

PrimerHits function is looking for the primers in all orientations in the sequences.

primerHits <- function(primer, fn){
  nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
  return(sum(nhits > 0))
}

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs[[1]]),
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs[[1]]))

This checks that we can use cutadapt

system2(cutadapt, args = "--version")

Running cutadapt. Permission denied on raven but works on laptop

path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))

R1.flags <- paste0("-g", " ^", FWD)
R2.flags <- paste0("-G", " ^", REV)

for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags,
                             "-m 1", #discards reads having length zero bp
                             "--discard-untrimmed",
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], #output files
                             fnFs[i], fnRs[i])) #input files
}

6 Check Quality after Removing Primers

Path to fastq files after trimming primers

path.cut <- "../Data/cutadapt"
list.files(path.cut)

Sorting forward and reverse reads based on file name pattern. save as objects

fnFs_cut <- sort(list.files(path.cut, pattern=".R1.fastq", full.names = TRUE))
fnRs_cut <- sort(list.files(path.cut, pattern=".R2.fastq", full.names = TRUE))

sample.names <- str_replace(basename(fnFs), ".1.R1.fastq","")

This shows a plot of the quality of the forward reads after removing primers

plotQualityProfile(fnFs_cut[1:5])

This plot shows quality of the reverse reads after removing primers

plotQualityProfile(fnRs_cut[1:5])

7 Filter and trim

filtF1s <- file.path(path, "filtered", paste0(sample.names, ".R1.filt.fastq.gz"))
filtR1s <- file.path(path, "filtered", paste0(sample.names, ".R2.filt.fastq.gz"))
names(filtF1s) <- sample.names
names(filtR1s) <- sample.names

out <- filterAndTrim(fnFs_cut, filtF1s, fnRs_cut, filtR1s, truncLen=c(200,200),
                      maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
                      compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE

head(out)

Checking the quality after filtering and trimming

plotQualityProfile(filtF1s)

plotQualityProfile(filtR1s)

errF1 <- learnErrors(filtF1s, multithread=TRUE,verbose = 0)
errR1 <- learnErrors(filtR1s, multithread=TRUE,verbose = 0)

visualise the estimated error rates

plotErrors(errF1, nominalQ=TRUE)

derepF1s <- derepFastq(filtF1s, verbose = TRUE)
derepR1s <- derepFastq(filtR1s, verbose = TRUE)
dadaF1s <- dada(derepF1s, err = errF1, multithread = TRUE)

dadaR1s <- dada(derepR1s, err = errR1, multithread = TRUE)

dadaF1s[[1]]

8 Merge paired reads

mergers <- mergePairs(dadaF1s, filtF1s, dadaR1s, filtR1s, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])

9 Construct sequence table

construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.

seqtab <- makeSequenceTable(mergers)

dim(seqtab)

The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.

#Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))

10 Remove Chimeras

The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE)

dim(seqtab.nochim)
sum(seqtab.nochim)/sum(seqtab)

11 Track the reads through the pipeline.

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaF1s, getN), sapply(dadaR1s, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)

12 Taxon Assignment

In theory, one command takes the output from above and assigns taxonomy based on a reference file

taxonomy <- assignTaxonomy(
  seqtab.project.miseqrun1,
  "../Data/ALL_Atlantic_2_Feb_2017.fasta",
  taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "species"),
  tryRC = FALSE,
  minBoot = 50,
  outputBootstraps = TRUE,
  multithread = TRUE,
  verbose = TRUE
)

However, our input file is not formatted correctly, and it give this error:

Error in assignTaxonomy(seqtab.project.miseqrun1, “../Data/ALL_Atlantic_2_Feb_2017.fasta”, : Incorrect reference file format for assignTaxonomy.

12.1 Set up the Atlantic Fasfa file

Here is the original file

head -n 15 ../Data/ALL_Atlantic_2_Feb_2017.fasta | tail -n 5
## >0089551016__Eucinostomus_
## NNNNNNNNNNNNNNNNNNNNNNNCCTTTACCTCATCTTT-GGTGCCTG-AGCT-GGTATGGT-AGGGACAGCCCTTAGCCTA-CTTATCCGTGCCGAGCTAAGCCAACCCGGTTCTCTCCTAGGTGATGACCAGATCTATAATGTTATTGTTACAGCTCACGCATTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGCTTCGGGAACTGACTAATCCCCCTAATGATTGGGGCACCTGACATAGCATTCCCTCGAATAAATAACATGAGCTTCTGGCTTCTTCCCCCCTCTTTTATTCTTCTCCTAGCCTCTTCAGGCGTAGAAGCAGGGGCCGGTACCGGATGAACAGTTTACCCTCCCCTGGCAGGTAATTTAGCACATGCGGGGGCATCCGTAGACTTAACTATCTTTTCTCTCCACCTTGCTGGGATCTCATCAATCCTAGGGGCAATTAACTTTATTACAACCATTATTAACATAAAACCCCCTGCCATTTCCCAGTACCAAACACCTTTATTCGTTTGAGCTGTTTTAATCACGGCTGTCCTGCTTCTTCTTTCCCTCCCC-GTTCTAGCAGCCGGTATTACTATACTTCTAACAGACCG-AAACTTAAATACTACATTTTTTGACCC-T-GCAG-GGGGAGGTGA--CCCTATCCTCTATCAACATCTCTTCNNNNNNNNNN
## >BZLWA189_06|JQ840067|BZLW4189|Gerreidae
## NNNNNNNNNNNNNNNNNNNNNNNCCTTTACCTCATCTTT-GGTGCCTG-AGCT-GGTATGGT-AGGGACAGCCCTTAGCCTA-CTTATCCGTGCCGAGCTAAGCCAACCCGGTTCTCTCCTAGGTGATGACCAGATCTATAATGTTATTGTTACAGCTCACGCATTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGCTTCGGGAACTGACTAATCCCCCTAATGATTGGGGCACCTGACATAGCATTCCCTCGAATAAATAACATGAGCTTCTGGCTTCTTCCCCCCTCTTTTATTCTTCTCCTAGCCTCTTCAGGCGTAGAAGCAGGGGCCGGTACCGGATGAACAGTTTACCCTCCCCTGGCAGGTAATTTAGCACATGCGGGGGCATCCGTAGACTTAACTATCTTTTCTCTCCACCTTGCTGGGATCTCATCAATCCTAGGGGCAATTAACTTTATTACAACCATTATTAACATAAAACCCCCTGCCATTTCCCAGTACCAAACACCTTTATTCGTTTGAGCTGTTTTAATCACGGCTGTCCTGCTTCTTCTTTCCCTCCCC-GTTCTAGCAGCCGGTATTACTATACTTCTAACAGACCG-AAACTTAAATACTACATTTTTTGACCC-T-GCAG-GGGGAGGTGA--CCCTATCCTCTATCAACATCTCTTCNNNNNNNNNN
## >BZLWA190_06|JQ840070|BZLW4190|Gerreidae

Translate command to make | into ;

tr '|' ';' < ../Data/ALL_Atlantic_2_Feb_2017.fasta \
> ../Data/ALL_Atlantic_2_Feb_2017_edit.fasta

Translate commands to remove returns

tr '
' ';' < ../Data/ALL_Atlantic_2_Feb_2017_edit.fasta \
> ../Data/ALL_Atlantic_2_Feb_2017_edit2.fasta

Sed command to put returns where I want them

sed 's/>/\'$'\n/g' < ../Data/ALL_Atlantic_2_Feb_2017_edit2.fasta \
> ../Data/ALL_Atlantic_2_Feb_2017_edit3.fasta

sed command to put > back at beginning of each line

sed 's/^/>/' < ../Data/ALL_Atlantic_2_Feb_2017_edit3.fasta \
> ../Data/ALL_Atlantic_2_Feb_2017_edit4.fasta

Here is how the referece file looks now

head -n 15 ../Data/ALL_Atlantic_2_Feb_2017_edit4.fasta | tail -n 3
## >SMSA146_09;JQ842451;SMSA7146;Eucinostomus;NNNNNNNNNNNNNNNNNNNNNNNCCTTTACCTCATCTTT-GGTGCCTG-AGCT-GGTATGGT-AGGGACAGCCCTTAGCCTA-CTTATCCGTGCCGAGCTAAGCCAACCCGGTTCTCTCCTAGGTGATGACCAGATCTATAATGTTATTGTTACAGCTCACGCATTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGCTTCGGGAACTGACTAATCCCCCTAATGATTGGGGCACCTGACATAGCATTCCCTCGAATAAATAACATGAGCTTCTGGCTTCTTCCCCCCTCTTTTATTCTTCTCCTAGCCTCTTCAGGCGTAGAAGCAGGGGCCGGTACCGGATGAACAGTTTACCCTCCCCTGGCAGGTAATTTAGCACATGCGGGGGCATCCGTAGACTTAACTATCTTTTCTCTCCACCTTGCTGGGATCTCATCAATCCTAGGGGCAATTAACTTTATTACAACCATTATTAACATAAAACCCCCTGCCATTTCCCAGTACCAAACACCTTTATTCGTTTGAGCTGTTTTAATCACGGCTGTCCTGCTTCTTCTTTCCCTCCCC-GTTCTAGCAGCCGGTATTACTATACTTCTAACAGACCG-AAACTTAAATACTACATTTTTTGACCC-T-GCAG-GGGGAGGTGA--CCCTATCCTCTATCAACATCTCTTCNNNNNNNNNN;
## >0089551017__Atherinidae_;NNNNNNNNNNNNNNNNNNNNNNNCCTCTATCTAGTATTT-GGTGCTTG-AGCC-GGAATGGT-AGGCACTGCCTTAAGCCTT-CTAATTCGGGCAGAATTAAGCCAACCAGGCTCTCTCCTTGGAGATGACCAAATTTACAATGTAATCGTTACAGCACACGCCTTTGTAATAATTTTCTTTATAGTAATACCAATTATGATTGGAGGATTTGGAAACTGACTTATTCCCCTTATGATTGGAGCCCCTGATATGGCATTTCCCCGAATGAACAATATGAGCTTCTGACTCCTCCCTCCCTCTTTTCTTCTCCTTCTTGCATCATCTGGAGTTGAAGCCGGAGCTGGTACAGGCTGAACAGTCTACCCCCCTCTAGCAGGAAACCTAGCTCACGCAGGTGCATCTGTAGATCTTACCATCTTTTCTCTCCACCTAGCAGGTGTTTCATCAATTCTAGGGGCTATCAACTTCATTACAACTATTATTAATATGAAACCCCCTGCAATCTCACAATACCAGACACCCCTGTTCGTTTGAGCTGTCTTAATTACTGCCGTTCTTCTCCTTCTTTCCCTTCCA-GTTCTGGCAGCTGGCATCACTATACTACTAACAGACCG-AAATCTAAATACCACCTTCTTCGACCC-T-GCCG-GAGGGGGTGA--TCCCATTCTGTATCAACATCTCTTCNNNNNNNNNN;
## >0089551020__Ogilbia_;NNNNNNNNNNNNNNNNNNNNNNNCCTCTACCTGGTATTC-GGTGCTTG-GGCA-GGAATAGT-AGGCACAGCACTAAGCCTC-CTTATTCGGGCAGAACTGAGCCAGCCCGGCTCTCTCCTTGGAGATGACCAAATTTATAACGTCATCGTCACAGCCCACGCCTTCGTAATAATTTTCTTTATAGTAATGCCAATCATGATCGGTGGATTTGGAAACTGACTTATTCCCCTGATGATTGGAGCCCCCGATATAGCGTTTCCCCGAATAAATAATATGAGCTTCTGGCTACTACCCCCATCCTTCCTTCTCCTTCTTGCCTCTTCCGGAGTAGAAGCAGGGGCAGGGACCGGCTGAACAGTCTACCCGCCCTTAGCAGGCAACCTCGCCCATGCAGGAGCCTCCGTTGACTTAACCATCTTCTCCCTTCACCTAGCCGGAGTTTCCTCCATCCTAGGAGCAATTAATTTCATCACAACAATTATTAACATGAAACCCCCAGCCATTTCCCAATACCAAACCCCTTTATTCGTGTGAGCCGTGTTAATTACAGCCGTTCTCCTTCTCCTATCCCTCCCG-GTCCTTGCCGCTGGTATCACCATACTATTAACAGACCG-AAATTTAAACACAACATTCTTCGACCC-C-GCTG-GTGGAGGTGA--CCCAATCTTATACCAACACCTATTCNNNNNNNNNN;

assignTaxonomy function should then assign taxonomy after editing the reference file to the appropriate format

taxonomy <- assignTaxonomy(
  seqtab.project.miseqrun1,
  "../Data/ALL_Atlantic_2_Feb_2017_edit4.fasta",
  taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "species"),
  tryRC = FALSE,
  minBoot = 50,
  outputBootstraps = TRUE,
  multithread = TRUE,
  verbose = TRUE
)

BUT IT STILL DIDN’T WORK even after editing the reference file. It returns this error:

Error in tax[[1]] : subscript out of bounds

Which appears to mean some sequences in the reference file are shorter than some desired length. So let’s just Blast like normal. A for effort.

13 Blast with defeat

Shout out to Halia cause I took this from her.

Check if the softwear is working and you can access the help menu

../Applications/ncbi-blast-2.13.0+/bin/blastx -h

make the blast database

../Applications/ncbi-blast-2.13.0+/bin/makeblastdb \
-in ../Data/ALL_Atlantic_2_Feb_2017.fasta \
-dbtype nucl \
-out ../Output/Reference_DB

Write output from dada pipline (seqtab.nochim) to a fasta file that can be blasted transpose table

seqtab.nochim_trans <- as.data.frame(t(seqtab.nochim)) %>% rownames_to_column(var = "sequence") %>% 
    rowid_to_column(var = "OTUNumber") %>% mutate(OTUNumber = sprintf("otu%04d", 
    OTUNumber)) %>% mutate(sequence = str_replace_all(sequence, "(-|\\.)", ""))

convert seqtab.nochim_trans to fasta file and double check where it goes in the directory

df <- seqtab.nochim_trans
seq_out <- Biostrings::DNAStringSet(df$sequence)

names(seq_out) <- str_c(df$OTUNumber, df$Supergroup, df$Division, df$Class, 
    df$Order, df$Family, df$Genus, df$Species, sep = "|")

Biostrings::writeXStringSet(seq_out, str_c( "Test_1_ASV.fasta"), compress = FALSE, 
    width = 20000)

examine fasta file

head ../Data/Test_1_ASV.fasta
## >otu0001
## TCTATCAGGCAACCTAGCCCACGCAGGTGCTTCCGTAGACTTAACCATCTTCTCCCTCCACCTGGCAGGTATTTCTTCAATCCTGGGAGCCATCAACTTTATCACTACCATTATTAATATGAAACCCCCTGCCATTTCCCAGTATCAAACCCCTCTTTTCGTATGGGCTGTTCTTATTACTGCCGTACTTCTCCTTCTGTCCCTCCCAGTCTTAGCTGCTGGCATCACAATACTTCTAACTGATCGAAACCTAAACACTACATTCTTTGACCCTGCGGGAGGAGGTGACCCAATCCTTTATCAACACCTGTTC
## >otu0002
## CTTAGCGGGCAATCTTGCCCATGCCGGGGCATCTGTCGACCTAACAATTTTCTCCTTGCACTTAGCAGGCATTTCATCAATCCTGGGGGCAATCAATTTTATTACAACAATTATTAATATAAAACCCCCAGCTATCTCCCAGTACCAAACTCCACTGTTTGTATGAGCTGTCTTAATTACGGCAGTTCTTTTACTTCTTTCGCTCCCAGTCCTTGCCGCCGGTATTACAATACTGCTTACTGATCGAAACCTCAACACCACCTTCTTTGACCCAGCGGGGGGAGGAGACCCAATTCTTTACCAACACCTCTTC
## >otu0003
## TCTATCAGGCAACCTAGCCCACGCAGGTGCTTCCGTAGACTTAACCATCTTCTCCCTCCACCTGGCAGGTATTTCTTCAATCCTGGGAGCCATCAACTTTATCACTACCATTATTAATATGAAACCCCCTGCCATTTCCCAGTATCAAACCCCTCTTTTCGTATGGGCTGTTCTTATTACTGCCGTACTTCTCCTTCTATCCCTCCCAGTCTTAGCTGCTGGCATCACAATACTTCTAACTGATCGAAACCTAAACACCACATTCTTTGACCCTGCGGGAGGAGGTGACCCGATCCTTTATCAACACCTGTTC
## >otu0004
## CTTAGCGGGCAATCTTGCCCATGCCGGGGCATCTGTCGACCTAACAATTTTCTCCTTGCACTTAGCAGGCATTTCATCAATCCTGGGGGCAATCAATTTTATTACAACAATTATTAATATAAAACCCCCAGCTATCTCCCAGTACCAAACTCCACTGTTTGTGTGAGCTGTCTTAATTACGGCAGTTCTTTTACTTCTTTCGCTCCCAGTCCTTGCCGCCGGTATTACAATACTGCTTACTGATCGAAACCTCAACACCACCTTCTTTGACCCAGCGGGGGGAGGAGACCCAATTCTTTACCAACACCTCTTC
## >otu0005
## TGGCTGGAAACCTAGCCCACGCAGGCGCGTCTGTTGACTTAACAATTTTCTCCCTTCACCTTGCAGGTGTTTCTTCGATTCTTGGTGCAATTAATTTTATTACCACAATTATTAATATGAAGCCCCCAGCCATCTCTCAGTATCAAACGCCTTTATTTGTTTGAGCAACCCTAATTACAGCTGTCCTGCTTCTTCTTTCCCTTCCTGTTTTAGCTGCTGGCATCACAATGCTTCTCACAGACCGAAACCTTAACACCACTTTCTTCGACCCGGCAGGAGGAGGAGACCCAATCCTATACCAACACCTCTTC

14 Run Blast

../Applications/ncbi-blast-2.13.0+/bin/blastn \
-query ../Data/Test_1_ASV.fasta \
-db ../Output/Reference_DB \
-out ../Output/Test_1_ASV.tab \
-num_threads 8 \
-max_target_seqs 1 \
-outfmt 6

Examine blast output

head -2 ../Output/Test_1_ASV.tab
wc -l ../Output/Test_1_ASV.tab
## otu0001  DRRA1971_12|DRR110186|Chromis_multilineata  97.500  320 1   6   1   313 371 690 4.75e-154   540
## otu0002  LFG6_01__201780231_ 97.812  320 0   6   1   313 371 690 1.02e-155   545
## 14 ../Output/Test_1_ASV.tab

This is really ugly, but I think it technically worked.

15 End Table

This is the desired output from the bioinformatic pipeline. Some columns were omitted for brevity (for example, the columns with the sequence, kingdom, and phylum). ESVid is the ID of the Estimated Sequence Variance. Class Order, Family, Genus, and Species refer to the taxonomic identification of the ESV. X..Match refers to the percent match. Columns with names following S0434##.# format refer to the sample (individual lionfish gut) Numbers in the matrix refer to how many sequences of the given taxa were found in the sample, with zero being not found in that sample. Number of sequences does not correlate with the abundace or amount of prey from a species in the stomach. We can only infer presence/ absence and frequency of occurance of each taxon.

Taxon Pterois volitans (Red Lionfish) reads removed under the assumption that these reads are from the stomach lining of the individual and not representative of conspecific consumption (canabalism).

Data interpretation: We cannot assume more reads = more of that taxon in the sample. We can look at general presence/ absence or frequency of occurance.

table<-read.csv("../Output/JVB1606-UniCOI-read-data.csv") #Read in Table
table_no_lions <- table[-c(which(table$Genus=="Pterois")),] #Remove Lionfish reads
datatable(table_no_lions[,-c(1,3,4,5,18)]) #View table without a few columns